Tidy Data, Archives, Metadata

Week 3 of Data Science for NGA LTER REU Students

Liz Dobbins

NGA LTER / Axiom Data Science

2023-08-03

Last Week

  1. Intro to Programming (Python)
  2. Programming Best Practices
  3. Practiced Best Practices

Best Practices Solution

Data Life Cycle

Data Life Cycle. DataONE Best Practices

Signature Data Example

  • Plan: LTER has extensive data management
  • Collect: Multiple PIs, many years
  • Assure: Best quality. Nicely formatted.
  • Describe: There is some metadata on the NGA website
  • Preserve: Website includes links to archive
  • Discover: Informal
  • Integrate: Future possibilities
  • Analyze: Python/pandas

Julie Lowndes and Allison Horst

Definition of “Tidy Data”

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10):1-23. https://doi.org/10.18637/jss.v059.i10.

Example of Messy Data

A table of weights:

Plot SpeciesA SpeciesB
1 3.5 1.2
2 2.8 4.2


  • the variable weight is found in multiple columns
  • there are 2 types of species so those are actually variables
    • variables should not be used as column headers

Same Data, Now Tidy

Plot Species Weight
1 A 3.5
1 B 1.2
2 A 2.8
2 B 4.2


  • each row is an observation
  • queries are easier

Other Qualities of Tidy Data

  • Units not included in cell with data
  • Visual indicators (colors, fonts, italics) not used
  • Consistent names
  • Consistent date formats
  • Short, descriptive language (avoid abstract codes)
  • Use consistent value for missing data (NaN, -9999, blank OK for pandas)
  • Data uniquely assigned to a single table
  • Saved as plain text format (CSV)

Data Carpentry Ecology Lesson Exercise

  1. Work with a partner
  2. Open survey_data_spreadsheet_messy.xlsx in the Google Drive
  3. Identify what is wrong with the spreadsheet
  4. Discuss how you might fix it

After you go through this exercise, we will discuss as a group

Where to Discover Data

How to Discover Data


Scientific Data Discovery Streaming Video
Informally between researchers your mom’s emails
Via project or institutional website a link at nbc.com
Referenced in a journal article via a blog review
Discoverable within specialized archive, or repository AppleTV or Netflix
Discoverable in network of repositories (Data.gov, DataONE) IMDB

LTER Data Management Requirements

  • Sites must have an integrated Information Management System
  • Data available online within two years of data collection
  • Sites should submit data to repositories
  • Long-term (>20 years) usability of data
  • Metadata

Where is NGA LTER Data?

  • NGA Data Catalog
    • Portal supplied by DataONE
  • Data and Metadata is stored in the Research Workspace member node

Data Portals are Powered by Metadata



Data Discovery Using DataONE Activity

Exploring the DataONE Data Catalog